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ABSTRACT 

A new  radix-2  two-dimensional  direct  FFT 
developed  by  Rivard  is  generalized  in  this  paper 
to  include  arbitrary  radices  and  non-square  arrays. 
It  is  shown  that  the  radix-4  version  of  this  algo- 
rithm may  require  significantly  fewer  computations 
than  conventional  row-column  transform  methods. 
Also,  the  new  algorithm  eliminates  the  matrix 
transpose  operation  normally  required  when  the 
array  must  reside  on  a bulk  storage  device.  It 
requires  the  same  number  of  passes  over  the  array 
on  bulk  storage  as  efficient  matrix  transpose 
routines,  but  produces  the  transform  in  bit- 
reversed  order.  An  additional  pass  over  the  data 
is  necessary  to  sort  the  array  if  normal  ordering 
is  desired. 


INTRODUCTION 

Since  the  appearance  of  the  original  Cooley- 
Tukey  algorithm  in  1965,  the  standard  methods  of 
computing  the  two-dimensional  (2-D)  discrete 
Fourier  transform  (DFT)  of  an  array  have  capital- 
ized on  the  separability  of  the  2-D  DFT  [1].  Using 
a 1-D  FFT  algorithm,  row-wise  and  columnwise,  1-D 
DFT's  can  be  computed  to  yield  the  2-D  transform. 
This  scheme  amounts  to  decimating  and  transforming 
the  array  first  in  one  index  and  then  in  the  other. 

A now  algorithm  which  performs  the  decimation 
in  both  indices  simultaneously  has  been  derived  by 
Rivard  using  a holor  algebra  formalism  12).  Rivard 
demonstrated  that  his  radix-2  direct  2-D  FFT  elim- 
inates 25%  of  the  multiplies  required  by  the  con- 
ventional row-column  approach. 

The  purpose  of  this  paper  is  to  present  an 
alternate  derivation  of  the  new  algorithm  and  to 
extend  it  to  rectangular  arrays  and  arbitrary 
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radices.  Further,  we  show  that  even  larger 
savings  in  multiplies  are  obtained  when  the  algo- 
rithm is  generalized  to  operate  with  larger 
radices  and  on  higher  dimensional  arrays.  We 
refer  to  the  general  algorithm  as  the  vector 
radix  algorithm,  since  to  specify  the  decimation 
of  the  array,  multiple  radices  are  required,  one 
for  each  index  of  the  array. 

An  additional,  machine  dependent,  implication 
of  the  new  algorithm  is  explored  here  as  well. 

When  the  DFT  computation  is  implemented  on  a 
computer  with  an  insufficient  amount  of  core 
memory  to  contain  the  entire  array,  a matrix 
transpose  operation  is  a necessary  component  of 
the  row-column  approach.  This  fact  has  occasioned 
a literature  on  fast  transpose  techniques,  led  by 
Eklundh  [3],  However,  the  vector  radix  algorithm 
requires  no  transpose.  The  transpose  is,  in 
effect,  incorporated  into  the  transform.  A vector 
radix  transform  requires  the  same  number  of  passes 
over  the  array  on  secondary  storage  (e.g.  disk) 
as  the  Eklundh  transpose  algorithm,  when  the 
resulting  DFT  can  be  tolerated  in  bit  reversed 
order.  If  the  DFT  must  be  in  correct  order,  an 
additional  transfer  of  the  array  is  required  to 
perform  bit  reversed  sorting.  On  a machine  which 
is  I/O  intensive,  this  extra  pass  over  the  array 
may  compromise  the  computational  advantages  of 
the  new  algorithm. 

DERIVATION 

As  with  the  one-dimensional  FFT  algorithm, 
the  new  direct  two-dimensional  FFT  is  derived  by 
decomposing  the  DFT  into  sums  of  smaller  DFT's 
multiplied  by  "twiddle"  factors.  We  derive  here 
a single  stage  of  the  general  vector  radix  algo- 
rithm for  the  decimation  in  time  case.  This  is 
all  that  is  necessary,  since  the  complete  algo- 
rithm is  obtained  by  recursive  application  of  this 
basic  decomposition. 

We  suppose  the  2-D  DFT 

M-l  N-l  0 

X(k,fc)  * l l x [m,nl  w m w n 
n-0  n-0  M N 

k=0,...,M-l  «=0,...,N-1  (1) 

of  the  MxN  array  x(m,n]  is  desired,  where 
W = exp(-j2ir/M) 

M 
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Provided  radix  r,  divides  M and  radix  r divides  N: 
1 2 

M/r^  = P N/r^  = Q integers,  (2) 

the  DFT  may  be  computed  with  an  r^xr^  stage.  Using 
the  change  of  variables: 


m = r^p+u 


n=r2q+v 


uv 

XpQ(a+bP,  c+dQ) 


uv 

XpQ(afO 


Using  substitutions  (9)  and  10)  in  equation  (8) 
produces:  ■ , . 

1 2~l 


?,  c+dQ)  = l l W^U  W^V 


p = 0,...,P-1  q=0,...,Q-l 


u = 0,...,^-!  v=0,...,r2-l 


the  DFT  (1)  is  decomposed  into  r^r2  PxQ  DFT's. 
Performing  the  substitution  of  variables  of  (3) 
in  equation  (1) , we  obtain: 

P-1  rl_1  Q-l  r2_1 

X(k,5.)  =1111  x[r  p+u,r  q+v]  • 
p=0  u=0  q=0  v=0 

k(r  p+u)  i (r  q+v) 

WM  - «N  (4) 

From  equation  (2) : 

»*“  , »*" V”  . . (5) 

M P M N Q N 

Substituting  the  relations  of  (5)  into  equation (4) , 
we  obtain  the  desired  result: 


V1  V1 


X(k,*>  = l l w-  f x£ 


(k , «.)  (6) 


P-1  Q-l  . „ 

£».*>  = I l xlr  P+u,r  q+v]  //  W q 
y p=0  q=0  “ 

is  the  2-d  DFT  of  a PxQ  element  array. 

We  introduce  a second  change  of  variables  to 
split  the  computation  of  X(k,il)  into  r^xr2  element 
butterfly  groupings: 

k = a+bP  H = c+dQ 

a = 0,...,P-1  c = 0,...,Q-1  (7) 


b = 0,...,r1-l  d = 0,...,r2-l 


Performing  the  change  of  variables  (7)  on  equation 
(6)  yields:  r x 

X(a+bP,  c+dQ)  = l f „ (a+bP) u w<c+dQ)v 
u-0  V”0  M N 


XpQ  (a+bP,  c+dQ)  . 


w (a+bP) u . waUwbu  (c+dQ)v  . „cv  dv 
M M r ^ N N r2 


[(WMU  ^ XPQU'C)1-  (11) 

Equation  (11)  defines  the  basic  structure  of  the 
r^xr2  butterfly.  A few  observations  will  clarify 

this.  First,  there  is  one  butterfly  for  each 
combination  of  values  of  a and  c,  giving  a total 
of  PQ  butterflies.  Thus,  a and  c are  the  butter- 
fly indices.  Each  butterfly  computes  r^r^  values 

of  X ( • , • ) (indexed  by  b and  d)  from  ^1r2  values 

of  the  smaller  PxQ  element  DFT's  (indexed  by  u and 
v) . As  input  to  the  butterfly,  one  element  is 
taken  from  each  PxQ  point  transform.  These  ele- 
ments are  not  used  by  any  other  butterflies  in 
the  stage,  so  that  the  r^r2  output  values  of 

the  butterfly  may  be  stored  in  the  memory  loca- 
tions occupied  by  the  input  elements.  Therefore, 
as  with  the  1-D  FFT  algorithm,  the  vector  radix 
transform  may  be  computed  in  place. 

The  inputs  to  the  butterfly  are  premultiplied 
by  the  twiddle  factors  (in  parentheses) , then  an 
r^xr^  2-D  DFT  of  the  resulting  array  (in  square 

brackets)  is  computed.  The  fact  that  the  butter- 
fly can  be  interpreted  as  a 2-D  DFT  calculation 
suggests  several  structures  for  performing  the 
computation.  The  conventional  row-column  FFT 
structure  can  be  employed  or  the  vector  radix 
approach  can  be  used.  The  latter  is  generally 
more  efficient. 

One  very  important  observation  is  that  the 

product  of  the  twiddle  factors  wau  and  WCV  may  be 

M N 

precomputed  and  stored  in  a table.  The  computa- 
tional advantage  enjoyed  by  the  vector  radix  al- 
gorithm arises  from  the  fact  that  such  products 
are  not  computed,  whereas  they  are  computed  impli- 
citly in  the  row-column  approach. 

As  an  example  wo  consider  the  radix  2x2  case, 
V 

where  M=N=2  , v an  integer,  and  ^ = ^=2.  Equation 
(11)  reduces  to: 

X(a+b  f , c+d  i ) = 1 l (-l,bu+dv 

u=0  v=0 


“«rcv>  < Nv 


2 2 

a,c=0, . . . ,N/2-l 


b ,d=0 , 1 


The  PxQ  point  DFT's  are  periodic: 


One  possible  butterfly  structure  for  the 
computation  of  equation  (12)  is  shown  in  Figure  1. 
This  structure  has  a minimum  number  of  adds  and 
multiplies,  and  is  the  same  as  an  efficient  radix 


four  one-dimensional  butterfly  structure  with  the 
internal  multiplies  by  j removed. 


1 


2 2 

Figure  It  Radix  2x2  Butterfly 


X[0,0] 

X[0,1] 

X[0,2] 

X[0,3] 

X[1,0] 
X[l,l] 
X[l, 2] 
XU, 31 

X[2,0] 
X [2 , 1 ] 
X[2 ,21 
X [2 , 3J 

X [3 ,0] 
X(3,l] 
X [3 , 2 J 
X [3 , 3] 


Figure  2:  4x4  DFT  with  radix  2x2  algorithm. 

Solid  circles  indicate  2x2  butterflies  and 

twiddle  factor  multiplies  by  are  indicated 
by  the  exponents  k. 


some  flexibility  in  trading  CPU  speed  against  I/O 
transfer  rate.  In  an  attempt  to  quantify  this 
tradeoff,  statistics  for  the  two  algorithms  are 
presented  in  a table  at  the  end  of  this  section. 
The  total  number  of  complex  multiplies  and  butter- 
flies required  for  NxN  transforms  are  tabulated 
for  both  algorithms  and  various  radices.  The 
number  of  butterflies  is  included  as  a partial 
measure  ot  the  amount  of  "overhead"  that  the 
algorithms  might  be  expected  to  have.. 

In  counting  the  number  of  multiplies 
we  adopt  the  philosophy  that  all  butterflies  are 
treated  identically.  This  implies  that  all 
twiddle  factor  multiplies  are  counted  for  the 
second  and  subsequent  stages.  Twiddle  factors 
are  unity  in  the  first  stage,  so  these  multiplies 
are  not  counted.  Multiplies  by  ±1  and  ± j internal 
to  the  butterflies  are  not  counted. 

As  a figure  of  relative  computational 
speed,  the  ratio  of  the  total  number  of  multiplies 
required  by  each  algorithm  to  the  total  number  of 
multiplies  required  by  the  radix  2 row-column 
algorithm  (for  N=4096)  is  also  tabulated. 

The  left-hand  side  of  the  table  contains 
information  on  the  number  of  times  that  the  entire 
array  must  be  transfered  from  bulk  memory  to  core 
memory  and  back.  This  information  is  presented  as 
a function  of  the  number  of  rows  of  the  array  (2, 
4,8,16)  that  can  be  stored  in  core  memory.  It  is 
assumed  that  the  Eklundh  transpose  algorithm  is 
used  in  the  row-column  approach.  This  requires 
log^/log^R  transfers,  where  R is  the  available 

number  of  rows  of  core  storage  and  we  assume  that 
N is  a power  of  R.  No  extra  transfers  are  neces- 
sary to  compute  the  row  and  column  FFT's,  since 
these  may  be  computed  during  the  first  and  last 
stages  of  the  Eklundh  algorithm.  The  vector  radix 
algorithm  requires  log^N/log^R  +1  passes  over  the 

array,  since  (in  the  radix  2x2  case)  log2R  stages 

of  the  FFT  may  be  computed  per  array  transfer. 

The  extra  pass  over  the  array  in  the  vector  radix 
algorithm  is  necessitated  by  the  2-D  bit-reversed 
sorting  of  the  data. 

It  is  readily  apparent  that  the  vector 
radix  rxr  algorithm  has  fewer  multiplies  and  but- 
terflies than  the  row-column  radix  r algorithm. 

In  the  case  where  r=2,  there  is  a 25%  reduction 
in  number  of  multiplies  and  fewer,  albeit  more 
complicated,  butterflies  with  the  2x2  algorithm. 
Experiments  with  FORTRAN  codings  of  the  two  algo- 
rithms have  shown  that  a 25%  reduction  in  computa- 
tion time  is  achieved  in  practice  for  N=64. 


In  Figure  2 we  display  the  radix  2x2  structure 
for  computation  of  a 4x4  (v=2)  element  DFT.  Rows 
of  the  bit-reversed  array  are  concatenated  and 
arranged  vertically  on  the  left  hand  side  of  the 
figure.  As  is  generally  true  for  the  2x2  algorithm, 
only  2 rows  of  the  array  are  accessed  simultaneous- 
ly, so  that  2 rows  worth  of  core  storage  are  suf- 
ficient to  implement  the  algorithm. 

COMPARISON  AND  DISCUSSION 


On  the  basis  of  relative  number  of  multi- 
plies and  butterflies,  we  conclude  that  the  radix 
4 row-column  algorithm  has  about  the  same  computa- 
tional complexity  as  the  vector  radix  2x2  algori- 
thm. The  fact  that  the  radix  4 1-D  algorithm  re- 
quires one  less  array  transfer  argues  strongly  in 
its  favor.  Its  disadvantages  are  less  flexibility 
in  transform  size  (N  must  be  a power  of  4 as 
opposed  to  a power  of  2)  and  the  fact  that  it 
produces  the  transpose  of  the  DFT  instead  of  the 
normally-ordered  DFT.  This  last,  rather  minor, 


Introduction  of  the  vector  radix  FFT  provides 


problem  is  characteristic  of  all  radices  in  the 
row-column  approach. 


The  larger  radix  algorithms  are  even  more 
attractive  as  far  as  the  relative  numbers  of 
multiplies  and  butterflies  are  concerned.  The 
radix  4x4  and  8x8  algorithms  appear  to  be  compu- 
tationally advantageous  even  when  compared  with 
the  1-D  radix  4,  8 and  16  algorithms.  However, 
the  coding  complexity  will  increase  dramatically 
with  increases  in  the  radix  size  for  the  vector 
radix  approach.  The  rather  marginal  decrease  in 
multiplies  in  the  radix  8x8  case  as  compared  to 
the  radix  4x4  case  probably  does  not  justify  the 
extra  coding  effort.  With  these  facts  in  mind, 
we  speculate  that  the  radix  4x4  algorithm  will  be 
the  algorithm  of  choice  for  most  applications 
involving  machines  limited  by  computational  speed 
or  machines  with  sufficient  core  storage  to  hold 
the  entire  array. 


for  d-dimensional  transforms.  For  large  values 
of  d,  this  ratio  asymptotically  approaches  2/d. 
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Transforms  of  non-square  arrays  can  be  accomo- 
dated in  the  vector  radix  approach  by  means  of 
mixed  radix  algorithms.  For  example,  Nx2N  trans- 
forms may  be  computed  with  one  stage  of  radix  2x4 
butterflies,  followed  by  log  N-l  stages  of  radix 


In  conclusion,  we  note  that  the  vector  radix 
algorithm  can  be  generalized  to  higher  dimensional 
transforms.  In  three  dimensions,  decimation  in  all 
three  indices  of  the  array  simultaneously  leads  to 
a radix  2x2x2  algorithm.  This  algorithm  requires 
7/12  the  number  of  multiplies  that  a conventional 
algorithm  using  1-D  FFT's  requires.  This  ratio 
improves  with  higher  dimensionality  and  is  equal  to 


TABLE  OF  OPERATION  COUNTS  (N=2  ) 

Disk  Accesses  Multiplies  per  Number  of  Total  Number 

FFT  TYPE  Radix  2 4 8 16  Butterfly*  Butterflies  of  Multiplies 

VR  2x2  V+l  v/2+1  v/3+1  V/4+1  3+0  1/4  N2V  3/4  N2(V-1) 

VR  4x4  - V/2+1  V/2+1  V/4+1  15+0  1/32  N2V  15/32  N2(V-2) 

VR  8x8  - - V/3+1  V/3+1  63+24  1/192  N2V  29/64  N2(V-63/29) 

RC  2 V V/2  v/3  V/4  1+0  N2V  N2(v-1) 

RC  4 v V/2  V/3  V/4  3+0  1/4  NV  3/4  N2(V-2) 

2 2 

RC  8 v V/2  V/3  V/4  7+2  1/12  N V 3/4  N (V-7/3) 

2 2 

RC  16  v V/2  V/3  \>/4  15+9  1/32  N V 3/4  N (V-  5/2) 


Twiddle  factor  multiplies  (first)  and  internal  butterfly  multiplies  (second)  are  tabulated  separately 


